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• We are a non-profit Digital Library & Archive founded in 1996 

• 20+PB unique data: 10PB web, ~8m text, 2m vid, 2m aud, 100K soft, etc 

• We work in a former church and it's awesome 

• Developed: Heritrix, Wayback, warcprox, Umbra, NutchWax, ARC format 

• Engineers, librarians/archivists, program staff 








INTERNET ARCHIVE 

uiayDacumachine 

• https://archive.org/web 

• Largest and oldest publicly available web 
archive in existence 

• 485,000,000,000+ URLs (that's billions) 

• Like a billion websites, domain agnostic 

• Content in 40+ Languages 

• Periodic snapshot; 1b+ URLs per week 
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https://archive-it.org/ 

Web archiving service 
used by 370+ institutions 

3500+ collection, 10 
billion+ URLs 

49 states and 19 countries 

Libraries, archives, 
museums, governments, 
non-profits, etc. 

User groups, Annual 
Meeting, collaborative 
and educational projects 
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What is a web archive? 



• Web archiving is the process of collecting portions of 
web content, preserving the collections, and then 
providing access to the archives - for use and re use. 

• A web archive is a collection of archived URLs grouped 
by theme, event, subject area, or web address. 

• A web archive contains as much as possible from the 
original resources and documents the change overtime. 
It recreates the experience a user would have had if 
they had visited the live site on the day it was archived. 




Web archive community 



WEB ARCHIVING IN THE UNITED 
STATES: A 2013 SURVEY 

AN NDSA REPORT 




E- 

UJ 



PJ 



ARCHIVE 




NDSA 2013 Survey 

• 70% of respondents using Archive-lt 

• 17% were using California Digital 
Library's Web Archiving Service 

• 81% of organizations devoting one half 
FTE or less to web archiving 

1 1 PC 2013 Survey 

Is your web archiving collection 
integrated in your preservation system? 

37% Yes 

26% Planning to 

37% Have not integrated their web 
collection 
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REASONS FOR NOT TRANSFERRING DATA FROM AN 

EXTERNAL SERVICE 




Building our in-house infrastructure No place to store Not sure what we would do with it 




The vast majority of 
information generated 
today will not survive 
1 00 years for reasons 
that have nothing to 
do with the 



Format Obsolescence: the David Rosenthal 

perspective 



interpretability of the 
bits. 



WARC (Web ARChive) Format 
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ISO 28500:2009 

Combines multiple 
digital resources into 
an aggregate archiva 
file together with 
related information 

Container file 

Written by crawlers 

Concatenated raw 
content 

For long-term 
storage and 
preservation 
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WARC: the What and What Not 



The What 

• Four required fields: 

— Record Identifier (URI) 

- Content Length/ Record 
Body Size 

- Timestamp 

- WARC Record Type: 8 
different types but most 
common is the archived 
response/resource (HTML, 
pdf, JavaScript...) 

• WARCs contain extensive 
technical metadata 
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The What Not 

• Rights and permissions 

• Descriptions 

• Agents & Events 

• File format identification 

• Validation 

• Characterization 
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WARC: The Guts 

The eight types of WARC records: 

• warcinfo - defines records that follow 

• response - scheme-specific response (full http response) 

• resource - direct retrieval w/o protocol 

• request - full http request w/ headers 

• metadata - further describe/explain harvested resource 
(hopsFromSeed, fetchTime) 

• revisit - revisitation of previously archived content (dedupe) 

• conversion - transformations 

• continuation - completion across segmentation 
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WARC file 
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WARC record 



Text header 



Content 

block 

[image/jpeg binary data] 




WARC/1.0 



WARC- Type: resource 

WARC-Target-URI: file:/A/ar/www/htdoc/tmages/logoc.jpg 
WARC-Date: 2006-09-1 9T17:20:24Z 

WARC- Record -ID: <um:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0> 



Content-Type: image/jpeg 

WARC-Payload-Digest: Sha 1 :CCMXETFVJD2MUZY6ND6SS7ZENMWF7KQ2 
WARC-Block-Digest shal :CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2 
Content-Length: 1662 






.. .etc. 



https://wiki.archivematica.org/Significant_characteristics_of_websites 





ARCHIVEIT-3336-DAILY-26569-20141107000046608-00000-wbgrp-crawl051.us.archive.org-6442.warc 
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OPEN FILES 



ARCHIVEIT-3336-DAILY- 



ARCHIVOT-3336-DAILY-26569-20141107000046608-00000-wbgrp-crawl051.us.archive.org-6442.warc * 

|WARC/1.0 

WARC-Type: response 

WARC-Target-URI: http://sfbay.craigslist.org/sfc/roo/4727260590.html 
WARC-Oate : 2014-11-07T00 : 00 : 56Z 

WARC-Payload-Oigest: shal:XLKPTY4QQ0F24LEBJYDMUIINKENXU5A0 
WARC-IP-Address: 208.82.238.146 

WARC-Record-ID: <urn:uuid:6ef7262f-515b-4f3d-8c28-2efl0b28b9161> 

Content-Type: application/http; msgtype=response 
Content-Length: 13253 

HTTP/1.1 200 OK 
Connection: close 

Cache-Control: max-age=300, public 
Last-Modified: Fri, 07 Nov 2014 00:00:56 GMT 
Date: Fri, 07 Nov 2014 00:00:56 GMT 
Vary: Accept-Encoding 
Content-Type: text/html; charset=UTF-8 
X-MCP-Cache-Control: max-age=2592000, public 
X-Frame-Options: SAMEORIGIN 
Server: Apache 

Expires: Fri, 07 Nov 2014 00:05:56 GMT 

<!DOCTYPE htral> 

<html class="no-js"> 

<head> 

<title>master bedroom with a pirate bathroom in a 3b/2b apartment. $1520/m</title> 

<meta name=" robots" content="NOARCHIVE,NOFOLLOW"> 

<meta name="description" content="Hi there, We are looking for one new roommate to fill our 3b/2b apartment at 320 Capp St. The 
room is available on Nov. 9th-ish. If you're interested, please reply with a bit about yourself and your..."> 

<meta name=" twit ter: card" content="preview"> 

<meta property="og: description" content="Hi there, We are looking for one new roommate to fill our 3b/2b apartment at 320 Capp 
St. The room is available on Nov. 9th-ish. If you're interested, please reply with a bit about yourself and your..."> 

<meta p rope rty="og : image" content="htt p : //images . c raigs list . o rg/00p0p_l3Ibaeev j Xx_600x450 . j pg"> 

<meta property="og:site_name" content=" craigs list "> 

<meta property="og: title" content="master bedroom with a pirate bathroom in a 3b/2b apartment. $1520/m"> 

<meta p rope rty="og: type" content="article"> 

<meta property="og:url" content="http://sf bay. craigslist.org/sfc/roo/4727260590.html"> 

<meta name=" viewport" content="initial-scale=1.0, user-sea lable=l"> 

<link type="text/css" rel="stylesheet" media="all" href="//www.craigslist.org/styles/cl.css?v=d7d4e51d9bacle78a31bbl5f752864cf"> 
<link type="text/css" rel=" stylesheet" media="all" href="//www. craigslist.org/styles/leaflet-stock. 
css?v=ea2cbe352bcad5eaelc267a9ddl5a5cl"> 

<!— [if It IE 9]> 

<script src="//www. craigslist.org/js/html5shiv. min. js?v=42e8031bc7ca9d67a48f4a5feff7bf29" type="text/javascript" x/script> 

<! [endif ] — > 

<!— [if Ite IE 7]> 

<script src="//www. craigslist.org/js/json2. rain. js?v=178d4ad319eQe0b4a451bl5e49b71bec" type="text/javascript" x/script> 

<! [endif] — > 








42 matches 



Tab Size: 4 



Plain Text 




Challenges to Preservation Metadata 
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• Concatenated nature 

• Unpack every resource? 

• Dispersed for storage 

• Arbitrary placement of 
resources in WARC files 

• Duplication / revisit 

• Unreliable mimes + format 
verification/obsolescence 

• Differentiated preservation actions 

• Volume of data 
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AIT 2015 Partner Survey 

• 80% of respondents do not currently store 
local copies of their WARC files 

— 53% plan on doing so in the future 

— 41% are considering this for the future 

• 20% ingest their WARCs into a digital 
preservation system or long-term repository 

• 14% create metadata for WARC files 
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Figure 1: The PREMIS Data Model 



The "Onion" Model 




objectCharacteristics 
composition Level 
Logical distinctions? 

"The individual filestream objects are 
not composition levels of the package 
file object. They should be considered 
separate objects , each with their own 
composition levels. " 

WARC 

• Record 

• Record type 
• Header 

• Content block 
• Payload 

• Bitstream 

• On and on 




Objects 



Harvest instance 
ARC file 
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hasHarvestlnstance 



Agents (type/role) 




Manager of the event (person/manager) 
Crawler (software/performer) 

Crawling Institution (organization/performer) 
Crawling server (software/issuer) 

User Agent (software/user) 

^Job (parameters/trigger) 



Harvest Event 




Event outcome detail extensions: 
ARC files list hosts report 
Event outcome: policy towards the 
\jobots.txt protocol 



Clement Oury, Sebastien Peyrard. From the World Wide Web to digital library stacks (2011) 



<structMap> & <admSec> 



<mets:structMap TYPE="logical"> 

<!— the website containing webpages — > 
<mets:div TYPE="WEBSITE"> 

<!— the first webpage —> 

<mets:div TYPE= "WEBPAGE "/> 

<! def initios of image — > 

<mets:div TYPE="ASSOCIATEDOBJECT" /> 

<!-- the second webpage —> 

<mets:div TYPE= "WEBPAGE" /> 

</mets:div> 

</mets:8tructMap> 



Figure 2. <structMap> in METS representing the logical 
structure of a harvested website 



<premis : event> 

<premis : eventldentif ier> 

<premis : eventldentif ierType>local 
</premis : eventldentif ierType> 

<premis : eventIdentifierValue>event01 
</premis : eventIdentifierValue> 

</premis : eventldentifier> 

<premis : eventType>migration 
</premis : eventType> 

<premis : eventDateTime>2006-07-l 6T1 9 : 20 : 30 
</premis : eventDateTime> 

<premis : linkingAgentldentif ier> 

<premis : 1 inkingAgent Identifier Type > local 
</premis : linkingAgentldentif ierType> 

<premis : linkingAgentldentif ierValue> 

agentOOl 

</premis : linkingAgentldentif ierValue> 
</premis : linkingAgentldentif ier> 

</premis : event> 



Figure 5. Representation of an event in PREMIS 



Marcus Enders. A METS based information package for long term accessibility of Web Archives. (2010) 




Marcus Enders. A METS based information package for long term accessibility of Web Archives. (2010) 



Premises that complicate PREMIS 

• Little local acquisition 

• Format opacity 

• Concatenation / compression 

• Crawler variance 

• Policies / Agents 

• Scale, scale, scale 




"Practical" Approaches 



• Data redundancy over metadata 
granularity 

• Utilize Crawl/Crawler-specific resources 

• rmimetype-report.txt 

• crawl-report.txt 

• CDX index 

• Utilize additional crawl reporting 

• Host reports, etc 

• Decomposition levels? 

• Simplify events/agents/objects 




Closing/Discussion Thoughts 

• Forecasting Obsolescence 



• Collection vs. Control 

• Institutional vs. Technological 

• Lightweight Tonnage of Data 
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THANKS! 



Jefferson Bailey, Internet Archive 
jefferson@archive.org | @jefferson_bail 
Maria LaCalle, Internet Archive 
maria@archive.org 

Internet Archive 
https://archive.org 
Archive-lt 

https://arc h ive-it.org 
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